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1. Introduction 

Markov chain Monte Carlo is a commonly used approach to evaluating expec- 
tations of the form 6 f{x)TT{dx), where tt is an intractable probability 
measure, e.g. known up to a normalising constant. One simulates (X„)„>o, an 
ergodic Markov chain on X, evolving according to a transition kernel P with 
stationary limiting distribution tt and, typically, takes ergodic average as an 
estimate of 9. The approach is justified by asymptotic Markov chain theory, 
see e.g. |331I13]. Metropolis algorithms and Gibbs samplers (to be described in 
Section [2]) are among the most common MCMC algorithms, c.f. [36l l29l |43] . 
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The quality of an estimate produced by an MCMC algorithm depends on 
probabilistic properties of the underlying Markov chain. Designing an appropri- 
ate transition kernel P that guarantees rapid convergence to stationarity and 
efficient simulation is often a challenging task, especially in high dimensions. 
For Metropolis algorithms there are various optimal scaling results [37l EH El 
[TOl m |42l |43l |46] which provide "prescriptions" of how to do this, though they 
typically depend on unknown characteristics of tt. 

For random scan Gibbs samplers, a further design decision is choosing the 
selection probabilities (i.e., coordinate weightings) which will be used to select 
which coordinate to update next. These are usually chosen to be uniform, but 
some recent work [30l ESI [27l [TBI EHl [E] bas suggested that non- uniform weight- 
ings may sometimes be preferable. 

For a very simple toy example to illustrate this issue, suppose X — [0, 1] x 
[—100, 100], with ■k{xi,X2) oc a;i[™(l -I- sin(a;2)). Then with respect to xi, this tt 
puts almost all of the mass right up against the line 2:1 = 1. Thus, repeated Gibbs 
sampler updates of the coordinate xi provide virtually no help in exploring the 
state space, and do not need to be done often at all (unless the functional / of 
interest is extremely sensitive to tiny changes in xi). By contrast, with respect 
to X2 , this TT is a highly multi-modal density with wide support and many peaks 
and valleys, requiring many updates to the coordinate X2 in order to explore 
the state space appropriately. (Of course, as with any Gibbs sampler, repeatedly 
updating one coordinate does not help with distributional convergence, it only 
helps with sampling the entire state space to produce good estimates.) Thus, 
an efficient Gibbs sampler for this example would not update each of xi and 
X2 equally often; rather, it would update X2 very often and xi hardly at all. 
Of course, in this simple example, it is easy to see directly that xi should be 
updated less than X2, and furthermore such efficiencies would only improve 
the sampler by approximately a factor of 2. However, in a high-dimensional 
example (c.f. |12)). such issues could be much more significant, and also much 
more difficult to detect manually. 

One promising avenue to address this challenge is adaptive MCMC algo- 
rithms. As an MCMC simulation progresses, more and more information about 
the target distribution tt is learned. Adaptive MCMC attempts to use this new 
information to redesign the transition kernel P on the fly, based on the current 
simulation output. That is, the transition kernel P„ used for obtaining X„|X„_i 
may depend on {Xq, . . . , X„_i}. So, in the above toy example, a good adaptive 
Gibbs sampler would somehow automatically "learn" to update xi less often, 
without requiring the user to determine this manually (which could be difficult 
or impossible in a very high-dimensional problem). 

Unfortunately, such adaptive algorithms are only valid if their ergodicity can 
be established. The stochastic process (X„)„>o for an adaptive algorithm is no 
longer a Markov chain; the potential benefit of adaptive MCMC comes at the 
price of requiring more sophisticated theoretical analysis. There is substantial 
and rapidly growing literature on both theory and practice of adaptive MCMC 
(see e.g. [B [II ffl [101 [II SI 113 123 [Si [HI [II |H1 II H SZl Ei 1 which 
includes counterintuitive examples where X„ fails to converge to the desired 
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distribution tt (c.f. [5j |44j [3 [25] ) , as well as many results guaranteeing ergodicity 
under various assumptions. Most of the previous work on ergodicity of adaptive 
MCMC has concentrated on adapting Metropolis and related algorithms, with 
less attention paid to ergodicity when adapting the selection probabilities for 
random scan Gibbs samplers. 

Motivated by such considerations, in the present paper we study the ergod- 
icity of various types of adaptive Gibbs samplers. To our knowledge, proofs of 
ergodicity for adaptively-weighted Gibbs samplers have previously been consid- 
ered only by [28], and we shall provide a counter-example below (Example 13. ip 
to demonstrate that their main result is not correct. In view of this, we are not 
aware of any valid ergodicity results in the literature that consider adapting 
selection probabilities of random scan Gibbs samplers, and we attempt to fill 
that gap herein. 

This paper is organised as follows. We begin in Section [2] with basic defi- 
nitions. In Section [3] we present a cautionary Example 13.11 where a seemingly 
ergodic adaptive Gibbs sampler is in fact transient (as we prove formally later 
in Section [6]) and provides a counter-example to Theorem 2.1 of [28]. Next, 
we establish various positive results for ergodicity of adaptive Gibbs samplers. 
We consider adaptive random scan Gibbs samplers (AdapRSG) which update co- 
ordinate selection probabilities as the simulation progresses; adaptive random 
scan Metropolis-within-Gibbs samplers (AdapRSMwG) which update coordinate 
selection probabilities as the simulation progresses; and adaptive random scan 
adaptive Metropolis-within-Gibbs samplers (AdapRSadapMwG) that update coor- 
dinate selection probabilities as well as proposal distributions for the Metropolis 
steps. Positive results in the uniform setting are discussed in Section [4l whereas 
Section [5| deals with the nonuniform setting. In each case, we prove that under 
reasonably mild conditions, the adaptive Gibbs samplers are guaranteed to be 
ergodic, although our cautionary example does show that it is important to 
verify some conditions before applying such algorithms. 

2. Preliminaries 

Gibbs samplers are commonly used MCMC algorithms for sampling from com- 
plicated high-dimensional probability distributions tt in cases where the full con- 
ditional distributions of tt are easy to sample from. To define them, let {X , B{X)) 
be an d— dimensional state space where X = Xi x ■ ■ ■ x Xd and write Xn G X 
as Xn — {Xn,i, . • . , Xn,d)- Wc shall use the shorthand notation 

and similarly X-i = Xi x ■ ■ ■ x Xi^i x Xi+i x ■ ■ ■ x Xd- 

Let 7r(-|a;_i) denote the conditional distribution of Zi \ Z^i — x-i where 
Z ^ TT. The random scan Gibbs sampler draws Xn given Xn-i (iteratively 
for n = 1,2,3,...) by first choosing one coordinate at random according to 
some selection probabilities a — (ai, . . . ,ad) (e.g. uniformly), and then updat- 
ing that coordinate by a draw from its conditional distribution. More precisely, 
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the Gibbs sampler transition kernel P — 
following three steps. 

Algorithm 2.1 (RSGCa)). 



Pa is the result of performing the 



1. Choose coordinate i € {1, . . . ,d} according to selection probabilities a, i.e. 
with P(i = j) = aj 

2. Draw Y ^ ^T{■\Xn-l,^^) 

3. Set Xn '■— {Xn-l.l, ■ • ■ , Xn~l,i~l, Y, Xn^l,i+1, • • • , Xn-l,d)- 

Whereas the standard approach is to choose the coordinate i at the first 
step uniformly at random, which corresponds to a = . . . ,1/d), this may 

be a substantial waste of simulation effort if d is large and variability of co- 
ordinates differs significantly. This has been discussed theoretically in [3D] and 
also observed empirically e.g. in Bayesian variable selection for linear models in 
statistical genetics [48l [12] . 

Throughout the paper we denote the transition kernel of a random scan Gibbs 
sampler with selection probabilities a as Pa and the transition kernel of a single 
Gibbs update of coordinate i is denoted as Pi, hence Pa = X^iLi caPi- 

We consider a class of adaptive random scan Gibbs samplers where selection 
probabilities a — (ai, . . . ,ad) are subject to optimization within some subset 
y Q [0, l]*^ of possible choices. Therefore a single step of our generic adaptive 
algorithm for drawing Xn given the trajectory Xn-i, . ■ . , Xq, and current selec- 
tion probabilities a„_i — (q!„_i_i, . . . , an-i.d) amounts to the following steps, 
where is some update rule for an- 

Algorithm 2.2 (AdapRSG). 

1. Set an := -R„(q!o, • ■ • ,an-i,Xn-i, . ■ . ,Xq) e y 

2. Choose coordinate i G {1, . . . ,d} according to selection probabilities a„ 

3. Draw Y T:{-\Xn-i,^r) 

4- Set Xn := {Xn-1,1, ■ . . , Xn~lA-l, Y, Xn-l,i+l, . . . , Xn-l,d) 

Algorithm 12.21 defines P„, the transition kernel used at time n, and a„ plays 
here the role of r„ in the more general adaptive setting of e.g. 011 [8]. Let 
TTn = TTnixo, cto) dcuotc the distribution of Xn induced by Algorithm 12 . 1 1 or \T2\ 
given starting values xq and ao, i.e. for B e B{X), 

TTniB) ^7Tn{{xo,ao),B) P(X„ e B\XQ^xo,ao). (1) 

Clearly if one uses Algorithm 12 . 1 1 then ao = a remains fixed and 7r„(xo, ck){B) = 
Pa{xo, B). By denote the total variation distance between probability 

measures v and /i. Let 

r(xo, ao, n) := |l7r„(a;o, ao) - -k^tv ■ (2) 

We call the adaptive Algorithm 12.21 ergodic if r(a;o,ao,n) — >■ for 7r-almost 
every starting state a;o and all ap G y. 
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We shall also consider random scan Metropolis-within-Gibbs samplers that 
mstead of sampling from the full conditional at step (2) of Algorithm[53] (respec- 
tively at step (3) of Algorithm 12.21) . perform a single Metropolis or Metropolis- 
Hastings step [32l[22]. More precisely, given Xn-i,-i the i-ih coordinate 
is updated by a draw Y from the proposal distribution Qx„^i _i(^n-i,i, •) with 
the usual Metropolis acceptance probability for the marginal stationary distri- 
bution 7r(-|X„_i^_i). Such Metropolis-within-Gibbs algorithms were originally 
proposed by [32] and have been very widely used. Versions of this algorithm 
which adapt the proposal distributions (5x„_i -iiX„-i^i, •) were considered by 
e.g. [101133], but always with fixed (usually uniform) coordinate selection prob- 
abilities. If instead the proposal distributions Qx„-i,-i{Xn-i,i, •) remain fixed, 
but the selection probabilities are adapted on the fly, we obtain the following 
algorithm (where qx,-i{x,y) is the density function for Qx^-i{x, •)). 

Algorithm 2.3 (AdapRSMwG). 

1. Set an := i?„(ao, • ■ • , an-i, -'^n-i, • ■ • ,Xo) e y 

2. Choose coordinate i G {1, . . . ,d} according to selection probabilities a„ 

3. DrawY ^Qx^^^,_,{Xn^i^,,-) 
4- With probability 

7r(y|X„_i,_,) gx„_i,_,(y,^n-i,») \ , 

^(X„_i|X„_i,_,0 qx„_,,_AXn-l,^,Y) J ' ^ ' 

accept the proposal and set 

Xn — {Xn-l^l, . . . , Xn-l.i-l,Y, X^-lA+l, ■ ■ ■ , Xn-l^d) 

otherwise, reject the proposal and set Xn ~ Xn-i- 

Ergodicity of AdapRSMwG is considered in Sections 14.21 and [5] below. Of course, 
if the proposal distribution Qx„-i^-i{Xn-i,ij ■) is symmetric about Xn-i, then 
the q factors in the acceptance probability ^ cancel out, and ^ reduces to 
the simpler probability min (l, 7r(y|Ar„_i_„j)/7r(X„_i|X„_i__i)). 

We shall also consider versions of the algorithm in which the proposal distri- 
butions Qx„_i -i{Xn-i,i, •) are also chosen adaptively, from some family {Q^^i^^j^gr, 
with corresponding density functions qx-i,-y, as in e.g. the statistical genetics 
application [?S1 H^- Versions of such algorithms with fixed selection proba- 
bilities are considered by e.g. [50] and [3S]. They require additional adapta- 
tion parameters jn.i that are updated on the fly and are allowed to depend 
on the past trajectories. More precisely, if 7„ = (7n,i, . • . , 7n,d) and Gn = 
a{Xo, . . . , Xn, cto, ■ ■ ■ , oin, 7o, ■ • ■ ) 7ri}, then the conditional distribution of 7„ 
given Qn-i can be specified by the particular algorithm used, via a second 
update function R'^. If we combine such proposal distribution adaptions with 
coordinate selection probability adaptions, this results in a doubly-adaptive al- 
gorithm, as follows. 

Algorithm 2.4 (AdapRSadapMwG). 

1. Set an := i?n(ao, . . . ,a„_i, A:„_i, . . . ,Xo,7„_i, . . . ,70) € 3^ 
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2. Set 7„ := i?'„(ao, . . . , Q!„_i, X„_i, . . . ,Xo,7„_i, . . . , 70) G Ti x . . . x r„ 

3. Choose coordinate i € {1, . . . ,d} according to selection probabilities a, i.e. 
with P(j — j) = aj 

4. Draw Y Qx„_i,_i,7„_i,i(-'«^n-i,i, •) 

5. With probability 

accept the proposal and set 

otherwise, reject the proposal and set Xn — Xn-i. 
Ergodicity of AdapRSadapMwG is considered in Sections 14.31 and [51 below. 

3. A counter-example 

Adaptive algorithms destroy the Markovian nature of (X„)„>o, and are thus 
notoriously difficult to analyse theoretically. In particular, it is easy to be tricked 
into thinking that a simple adaptive algorithm "must" be ergodic when in fact 
it is not. 

For example, Theorem 2.1 of [35] states that ergodicity of adaptive Gibbs 
samplers follows from the following two simple conditions: 

(i) an — J> a a.s. for some fixed a G (0, 1)'*; and 

(ii) The random scan Gibbs sampler with fixed selection probabilities a in- 
duces an ergodic Markov chain with stationary distribution tt. 

Unfortunately, this claim is false, i.e. (i) and (ii) alone do not guarantee 
ergodicity, as the following example and proposition demonstrate. (It seems 
that in the proof of Theorem 2.1 in [28j, the same measure is used to represent 
trajectories of the adaptive process and of a corresponding non-adaptive process, 
which is not correct and thus leads to the error.) 

Example 3.1. Let N = {1, 2, . . . }, and let the state space X = {(«, j) S N x N : 
i ~ j or i — j + 1}, with target distribution given by T^ii,]) oc j^^. On X, 
consider a class of adaptive random scan Gibbs samplers for tt, as defined by 
Algorithm 12. 2[ with update rule given by: 

{^ + ^'1-^} if ^ = J' 

(4) 

for some choice of the sequence (a„)J^g satisfying 8 < a„ 00. 



Rn I Oin-l, Xn-l 
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Example 13.11 satisfies assumptions (i) and (ii) above. Indeed, (i) clearly holds 
since a„->Q!;=(i,i), and (ii) follows immediately from the standard Markov 
chain properties of irreducibility and aperiodicity (c.f. [SSlH^). However, if a„ 
increases to cxi slowly enough, then the example exhibits transient behaviour 
and is not ergodic. More precisely, we shall prove the following: 

Proposition 3.2. There exists a choice of the {an) for which the process {Xn)n>o 
defined in Examvle \3.1\ is not ergodic. Specifically, starting at Xq = (1,1), we 
have P(X„^i — > oo) > 0, i.e. the process exhibits transient behaviour with positive 
probability, so it does not converge in distribution to any probability measure on 
X . In particular, ||7r„ — Trjlj^y 0. 

Remark 3.3. In fact, we believe that in Proposition 13.21 P(X„ i — > oo) — 1, 
though to reduce technicalities we only prove that P(X„_i oo) > 0, which is 
sufficient to establish non-ergodicity. 

A detailed proof of Proposition 13.21 is presented in Section [51 We also simu- 
lated Example 13. II on a computer (with the (a„) as defined in Section [6]), result- 
ing in the following trace plot of Xn.i which illustrates the transient behaviour 
since Xn,i increases quickly and steadily as a function of n: 
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4. Ergodicity - the uniform case 

We now present positive results about ergodicity of adaptive Gibbs samplers un- 
der various assumptions. Results of this section are specific to uniformly ergodic 
chains. (Recall that a Markov chain with transition kernel P is uniformly ergodic 
if there exist M < oo and p <1 s.t. ||P"(a;, ■)-tt{-)\\tv < Mp"^ for every x e X; 
see e.g. [331 113] for this and other notions related to general state space Markov 
chains.) In some sense this is a severe restriction, since most MCMC algorithms 
arising in statistical applications are not uniformly ergodic. However, truncating 
the variables involved at some (very large) value is usually sufficient to ensure 
uniform ergodicity without affecting the statistical conclusions in any practical 
sense, so the results of this section may be sufficient for a pragmatical user. The 
nonuniform case is considered in the following Section [S] 

To continue, recall that RSG(a) stands for random scan Gibbs sampler with 
selection probabilities a as defined by Algorithm 12.11 and AdapRSG is the adap- 
tive version as defined by Algorithm l2.2l For notation, let A^-i := {{pi, . . . , pd) G 
R'' ; Pi > 0, J2i=iPi ~ 1} tie the {d — 1)— dimensional probability simplex, and 
let 

y:^[e,lfnAa^i (5) 

for some < e < 1/d. We shall assume that all our selection probabilities are 
in this set 3^. 

Remark 4.1. The above assumption may seem constraining, it is however irrel- 
evant in practice. The additional computational effort on top of the unknown 
optimal strategy a* (that may be in A^-i — 3^) is easily controlled by setting 
e := {Kd)~^ that effectively upperbounds it by 1/K. The argument can be eas- 
ily made rigorous e.g. in terms of the total variation distance or the asymptotic 
variance. 

4-1- Adaptive random scan Gibbs samplers 

The main result of this section is the following. 

Theorem 4.2. Let the selection probabilities a„ G 3^ for all n, with y as in 
Assume that 

(a) |a„ — a„_i| — >■ in probability for fixed starting values Xq € X and € y. 

(b) there exists l3 Cz y s.t. RSG(/3) is uniformly ergodic. 

Then AdapRSG is ergodic, i.e. 

T{xo,aQ,n)^0 as n —>■ oo. (6) 

Moreover, if 
(a') sup^i^ |a„ — a„_i| — > in probability. 
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then convergence of AdapRSG is also uniform over all XQ,ao, i-c. 



sup T{xo,ao,n) — >■ 



as n ^ oo. 



(7) 



xo.ao 



Remark 4.3. 1. Assumption (b) will typically be verified for f3 = . • . , 
see also Proposition 14.81 below. 

2. We expect that most adaptive random scan Gibbs samplers will be de- 
signed so that |a„ — a„_i| < a„ for every n > 1, xq X , (1 y, and 
w S ri, for some deterministic sequence a„ (which holds for e.g. 
the adaptations considered in [H])- In such cases, (a') is automatically 
satisfied. 

3. The sequence a„ is not required to converge, and in particular the amount 
of adaptation, i.e. X^J^Li 1*^" ^ c^»i-i|, is allowed to be infinite. 

4. In Example 13. 1[ condition (a') is satisfied but condition (6) is not. 

5. If we modify Example 13.11 bv truncating the state space to say X ~ X D 
({1, . . . , M} X {1, . . . , M}) for some 1 < M < oo,, then the corresponding 
adaptive Gibbs sampler is ergodic, and ([7]) holds. 

Before we proceed with the proof of Theorem 14.21 we need some preliminary 
lemmas, which may be of independent interest. 

Lemma 4.4. Let /3 £ y with y as in If RSGCP) is uniformly ergodic, then 
also RSG(a) is uniformly ergodic for every a G 3^. Moreover there exist M < oo 
and p <1 s.t. snp^^^x ^^^y T{xq, a, n) < Mp"- -> 0. 

Proof. Let Pp be the transition kernel of RSG(/3). It is well known that for 
uniformly ergodic Markov chains the whole state space X is small (c.f. Theorem 
5.2.1 and 5.2.4 in [33] with their ip = tt). Thus there exists s > 0, a probability 
measure p on {X,B{X)) and a positive integer m, s.t. for every x £ X, 




r :— mm — . 

i Pi 



Since /3 G 3^, we have 1 > ?' > > and Pa can be written as a mixture 

of transition kernels of two random scan Gibbs samplers, namely 



Pa ~ rPf) + (1 — r)Pg, where q — 



This combined with ([8]) implies 



Pai^r) > r™P^"(x,.) > r"^sp{-) 




ra 



> ( — — 1 sp{-) for every x&X. (9) 



By Theorem 8 of [43j condition ^ implies 




for all x&X. (10) 
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Since the right hand side of (fTO| does not depend on a, the claim follows. □ 

Lemma 4.5. Let Pa and Pa' be random scan Gibbs samplers using selection 
probabilities a,a' E y := [e, 1 — {d — l)e]'^ for some e > 0. Then 

/ N „ / Ml la — a'l la — a'l ,,,, 

\\Pc.{x,-)~Po.'ix,-)\\TV<^. -■ 11 

e + |a — a I e 

Proof. Let S := \a — a'\. Then r := min, — > — ; n rr > — fr and reason- 

ing as in the proof of Lemma 14.41 we can write Pa' — rPa + (1 ~ r)Pq for some 
q and compute 

\\Pa(x,-)-Pa'{x,-)\\TV = |1 (^Pa + (1 - r)P„ )- (rP„ + (1 - r)P,) || TV 

= {l-r)\\Pa-P,\\TV<^, 

e + 

as claimed. □ 

Corollary 4.6. Pa{x,B) as a function of a on y is Lipschitz with Lipschitz 
constant \/e for every fixed set B G B{X). 

Corollary 4.7. //|a„ — a„_i| — )■ m probability, then also sn'p^^p^ \\Pa^{x, ■) — 
Pan-i{x, ')\\tv in probability. 

Proof of Theorem \4-S\ We conclude the result from Theorem 1 of [44" that re- 
quires simultaneous uniform ergodicity and diminishing adaptation. Simultane- 
ous uniform ergodicity results from combining assumption (b) and Lemma 14.41 
Diminishing adaptation results from assumption (a) with Corollary 14.71 More- 
over note that Lemma 14.41 is uniform in and ao and (a') yields uniformly 
diminishing adaptation again by Corollary 14.71 A look into the proof of Theo- 
rem 1 [33] reveals that this suffices for the uniform part of Theorem 14.21 □ 

Finally, we note that verifying uniform ergodicity of a random scan Gibbs 
sampler, as required by assumption (6) of Theorem 14. 2[ may not be straight- 
forward. Such issues have been investigated in e.g. |38| and more recently in 
relation to the parametrization of hierarchical models (see [35' and references 
therein) . In the following proposition, we show that to verify uniform ergodicity 
of any random scan Gibbs sampler, it suffices to verify uniform ergodicity of 
the corresponding systematic scan Gibbs sampler (which updates the coordi- 
nates 1, 2, . . . , d in sequence rather than select coordinates randomly). See also 
Theorem 2 of [34] for a related result. 

Proposition 4.8. Let a E y with y as in (0). // the systematic scan Gibbs 
sampler is uniformly ergodic, then so is RSG(a). 

Proof. Let 

P = PiP2---Pd 
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be the transition kernel of the uniformly ergodic systematic scan Gibbs sampler, 
where Pi stands for the step that updates coordinate i. By the minorisation con- 
dition characterisation, there exist s > 0, a probability measure /i on {X , B{X)) 
and a positive integer m, s.t. for every x £ X , 

P"(a;,-) >sm(-)- 

However, the probability that the random scan Gibbs sampler Pi/d in its md sub- 
sequent steps will update the coordinates in exactly the same order is (l/d)""^ > 
0. Therefore the following minorisation condition holds for the random scan 
Gibbs sampler. 

p^/i[x,■)>[l/dr^s^Ji{■). 

We conclude that RSG(l/d) is uniformly ergodic, and then by Lemma [4.41 it 
follows that RSGCa) is uniformly ergodic for any a ^y. □ 

4-2. Adaptive random scan Metropolis-within- Gibbs 

In this section we consider random scan Metropolis-within-Gibbs sampler al- 
gorithms (see also Section [5] for the nonuniform case). Thus, given Xn-i,-i, 
the i-th coordinate X„_i.i is updated by a draw Y from the proposal dis- 
tribution Qx„_i •) with the usual Metropolis acceptance probability 
for the marginal stationary distribution 7r(-|X„_i^_i). Here, we consider Algo- 
rithm AdapRSMwG, where the proposal distributions (5x„_i _i(^n-i,i,') remain 
fixed, but the selection probabilities are adapted on the fly. We shall prove 
ergodicity of such algorithms under some circumstances. (The more general al- 
gorithm AdapRSadapMwG is then considered in the following section.) 

To continue, let Px_i denote the resulting Metropolis transition kernel for 
obtaining Xn.i\Xn-i,i given Xn-i.-i = x^i. We shall require the following as- 
sumption. 

Assumption 4.9. For every i £ {l,...,d} the transition kernel Px_i is uni- 
formly ergodic for every x-i G X^i. Moreover there exist Si > and an in- 
teger rrii s.t. for every x^i G X^i there exists a probability measure Vx-i on 
{X,,B{X;)), s.t. 

P"l\{xi, •) > Sii/x_i(-) for every Xi G Xi. 
We have the following counterpart of Theorem 14.21 
Theorem 4.10. Let a-n G y for all n, with y as in Assume that 

(a) \an — a„-i \ in probability for fixed starting values xq G X and G y. 

(b) there exists l3 £ y s.t. RSG(/3) is uniformly ergodic. 

(c) Assumption \4-.9\ holds. 

Then AdapRSMwG is ergodic, i.e. 

T{xo,ao,n)^0 as n — > oo. (12) 

Moreover, if 
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(a') s\ypxg^ao — a„_i| — > in probability, 
then convergence of AdapRSMwG is also uniform over all xo,ao-, i-e. 

sup r(a;o, ao, n) — >■ as n — )■ oo. (13) 



Remark 4.11. Remarks 14.31 1- I4?3l 3 still apply. Also, assumption 14.91 can easily 
be verified in some cases of interest, e.g. 

1. Independence samplers are essentially uniformly ergodic if and only if the 
candidate density is bounded below by a multiple of the stationary density, 
i.e. q{dx) > S7r(da::) for some s > 0, c.f. [31,. 

2. The Metropolis-Hastings algorithm with continuous and positive proposal 
density <?(•,•) and bounded target density tt is uniformly ergodic if the 
state space is compact, c.f. [33l I43j . 



To prove Theorem 14.101 we build on the approach of [40] . In particular recall 
the following notion of strong uniform ergodicity. 

Definition 4.12. We say that a transition kernel P on X with stationary 
distribution tt is (m, s) — strongly uniformly ergodic, if for some s > and positive 
integer m 

P^^ix, •) > S7r(-) for every x e X. 

Moreover, we will say that a family of Markov chains {^'7} on X with 
stationary distribution tt is {m, s) — simultaneously strongly uniformly ergodic, if 
for some s > and positive integer m 

P^{x, •) > S7r(-) for every x e X and 7 e T. 

By Proposition 1 in |40j . if a Markov chain is both uniformly ergodic and 
reversible, then it is strongly uniformly ergodic. The following lemma improves 
over this result by controlling both involved parameters. 

Lemma 4.13. Let ^ be a probability measure on X , let m he a positive integer 
and let s > 0. If a reversible transition kernel P satisfies the condition 

•) > s^,{-) for every x £ X , 

then it is {J^ ■^^fi^J + 2^ m, —strongly uniformly ergodic. 
Proof. By Theorem 8 of [13] for every A £ B{X) we have 

\\P'\x,A)-n{A)\\Tv < (l-s)L«/™J, 

And in particular 

||P'=™(x,A)-7r(A)||Ty <5/4 for fc > (14) 
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Since tt is stationary for P, we have 7r(-) > s/Lt(-) and thus an upper bound for 
the Radon-Nikodym derivative 



d/i/dTT < 1/s. 

Moreover by reversibiUty 

7r(da:)P"(a:,dy) = 7r(dy)P'"(y, dx) > ■K{dy)s^i{Ax) 
and consequently 



(15) 



P™(a;,dy) > s(^(dx)/7r(da;))7r(dy). 



(16) 



Now define 



A:^{x€X : /i(da;)/7r(da;) > 1/2} 
Clearly ^l{A'') < 1/2. Therefore by (US]) we have 

1/2 < ^(A) < {l/s)n{A) 

and hence tt{A) > s/2. Moreover ([HI) yields 

l08ts/4) 



P'=™(a;, A) > s/4 for k := 
And with fc defined above by (jl6p we have 



log(l - s) 



P 



(x,-) = / P'='"(a;,dz)P"(z,-) > / P''™(x,dz)P'"(z,-) 



A" 



> 



P'="(a;,dz)(s/2)7r(-) > {s'/S)t:{-). 



□ 



This completes the proof. 

We will need the following generalization of Lemma 14.41 

Lemma 4.14. Let f3 E y with y as in 0). If RSG(f5) is uniformly ergodic 
then there exist s' > and a positive integer m' s.t. the family {RSG(a)^ ^ 



ey 



IS 



{m! , s')— simultaneously strongly uniformly ergodic. 

Proof. Pp{x, •) is uniformly ergodic and reversible, therefore by Proposition 1 in 
|40j it is (m, si)— strongly uniformly ergodic for some m and si. Therefore, and 
arguing as in the proof of Lemma [44l c.f. (|9]), there exist S2 > ( i-(d-i)e ) ; s.t. 
for every a € y and every x X 



P^{x,-)>S2P^{x,-)>siS27r{-). 
Set m' = m and s' = siS2. 



(17) 
□ 
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Proof of Theorem \4.10\ We proceed as in the proof of Theorem 14.21 i.e. es- 
tabUsh diminishing adaptation and simultaneous uniform ergodicity and con- 
clude (fT2)) and (IT3l) from Theorem 1 of j44l . Observe that Lemma 14.51 applies 
for random scan Mctropolis-within-Gibbs algorithms exactly the same way as 
for random scan Gibbs samplers. Thus diminishing adaptation results from as- 
sumption (a) and Corollary 14.71 To establish simultaneous uniform ergodicity, 
observe that by Assumption 14.91 and Lemma 14.131 the Metropolis transition 
kernel for ith coordinate i.e. Px^i has stationary distribution 7r(-|a;_i) and is 

( ( iog(il{f- ) j + 2^ TOj , —strongly uniformly ergodic. Moreover by Lemma l4.14l 
the family RSGCa), a d y is (m', s')— strongly uniformly ergodic, therefore by 
Theorem 2 of |40] the family of random scan Metropolis-within-Gibbs sam- 
plers with selection probabilities a £y, RSMwGCa), is (m*, s*)— simultaneously 
strongly uniformly ergodic with and s* given as in |40j . □ 

We close this section with the following alternative version of Theorem 14. 101 

Theorem 4.15. Let q;„ G 3^ for all n, with y as in Assume that 

(a) \an — Q.n-i\ in probability for fixed starting values Xq G X and ao G y. 

(b) there exists /S G y s.t. RSMwG(j3) is uniformly ergodic. 

Then AdapRSMwG is ergodic, i.e. 

T{xo,ao,n)^0 as n — >■ cx). (18) 

Moreover, if 
(a') sup^i^ |a„ — a„_i| — > in probability, 
then convergence of AdapRSMwG is also uniform over all a;o,ckOj *-e- 

sup T(xo, aoi ^ as n — > oo. (19) 

xo,ao 

Proof. Diminishing adaptation results from assumption (a) and Corollary 14.71 
Simultaneous uniform ergodicity can be established as in the proof of Lemma [4.4l 
The claim follows from Theorem 1 of |44|. □ 



Remark 4.16. Whereas the statement of Theorem 14.151 mav be useful in spe- 
cific examples, typically condition (b), the uniform ergodicity of a random scan 
Metropolis-within-Gibbs sampler, will be not available and establishing it will 
involve conditions required by Theorem 14. 101 



4-3. Adaptive random scan adaptive Metropolis-within-Gibbs 

In this section, and also later in Section [5l we consider the adaptive random 
scan adaptive Metropolis-within-Gibbs algorithm AdapRSadapMwG, that updates 
both selection probabilities of the Gibbs kernel and proposal distributions of 
the Metropolis step. Thus, given Xn-i,-i, the i-th coordinate Xn-i,i is up- 
dated by a draw Y from a proposal distribution Qx„-i^-i, ■y„^i{Xn-i,i, ■) with 
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the usual acceptance probability. This doubly-adaptive algorithm has been used 
by e.g. [12] for an application in statistical genetics. As with adaptive Metropo- 
lis algorithms, the adaption of the proposal distributions in this setting is 
motivated by optimal scaling results for random walk Metropolis algorithms 

[saiiiiisiiioiiiiiiiiisiiiiiiii. 

Let Px-i, 7„ i denote the resulting Metropolis transition kernel for obtaining 
Xn,i\Xn-i,i given Xn-i.^i — x^i. We will prove ergodicity of this generalised 
algorithm using tools from the previous section. Assumption 14.91 must be refor- 
mulated accordingly, as follows. 

Assumption 4.17. For every i G {!,..., d}, X-i G X-i and ji G Fj, the 

transition kernel Px^i, -n is uniformly ergodic. Moreover there exist Si > and 
an integer nii s.t. for every x^i G X^i and 7; G Ti there exists a probability 
measure Vx^i.-yi on {Xi, B{Xi)), s.t. 

Px'j,, 7,(2^i' ■) - ^t'^x^i, 7, (■) for every Xi G Xi. 
We have the following counterpart of Theorems 14.21 and 14.101 
Theorem 4.18. Let a„ G y for all n, with y as in Assume that 

(a) |a„ — a„-i| — > m probability for fixed starting values x^ G X , G y 
and 7o G F. 

(b) there exists /3 £ y s.t. RSGCP) is uniformly ergodic. 

(c) Assumption |^. j7| holds. 

(d) The Metropolis-within- Gibbs kernels exhibit diminishing adaptation, i.e. 
for every i G {1, . . . , d} the Qn+i measurable random variable 

sup 7„_|_i i {xi, •) — Px^i. 7„ i {xi, ■)\\tv — > m probability, as n ^ oo, 

xex 

for fixed starting values xq G X , ao G 3^ and jq. 
Then AdapRSadapMwG is ergodic, i.e. 

T{xo,ao,n)^0 as n^oo. (20) 

Moreover, if 
(a') sup^p |q!„ - a„„i| -> in probability, 

(d') snp^^^^^ sup^^x \\Px-i, -f„+i,i{xi, ■) \tv in probability, 

then convergence of AdapRSadapMwG is also uniform over all XQ,ao, i.e. 

sup r(a;o, ao, "-)—>■ as n^oo. (21) 

xa,ao 

Remark 4.19. Remarks 14.31 1- I4?3l 3 still apply. And, Remark 14.111 applies for 
verifying Assumption 14.171 Verifying condition (d) is discussed after the proof. 

Proof. We again proceed by establishing diminishing adaptation and simulta- 
neous uniform ergodicity and concluding the result from Theorem 1 of [44]. 
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To establish simultaneous uniform ergodicity we proceed as in the proof of 
Theorem 14.101 Observe that by Assumption 14.171 and Lemma 14.131 every adap- 
tive Metropolis transition kernel for jth coordinate i.e. Px^i. n has stationary 

distribution i:{-\x_i) and is \og(i-s\ \ ^) ~strongly uniformly er- 

godic. Moreover, by Lemma [4. 141 the family RSG(a), a G 3^ is (m', s')— strongly 
uniformly ergodic, therefore by Theorem 2 of |40| the family of random scan 
Metropolis-within-Gibbs samplers with selection probabilities a & y and pro- 
posals indexed by 7 G F, is (m*, s*)— simultaneously strongly uniformly ergodic 
with TO» and given as in |40) . 

For diminishing adaptation we write 

sup ||Pa„. 7„(a;, •) - ^'a„_i. 7^-1(2^' OIItv < 

sup ||Fq„, 7„(x, •) - Pa„_i, 7„ {X, ■)\\tV 

-h sup ||Pa„_i, 7„(a;, •) - Pa„_i, 7„_i(a:^, ■)\\tv 
xex 



The first term above converges to in probability by Corollarv 14.71 and assump- 
tion (a). The second term 



sup ||-Pq„_i, 7„(a;, •) - Pa„_i, 
xex 



converges to in probability as a mixture of terms that converge to in prob- 
ability. □ 



The following lemma can be used to verify assumption (d) of Theorem 14. 18( 
see also Example 14.211 below. 

Lemma 4.20. Assume that the adaptive proposals exhibit diminishing adapta- 
tion i.e. for every i G {1, . . . , d} the Gn+i measurable random variable 

sup WQx-i, 7„+i i (xi, •) — Qx-i, 7„ i {xi, ■)\\tv in probability, as n ^ 00, 
xex 

for fixed starting values Xq € X and Oq € y. 
Then any of the following conditions 

(i) The Metropolis proposals have symmetric densities, i.e. 

Qx-i, i„_i{Xi,yi) — Qx-i, 7„_i (j/i: 2;^), 

(ii) Xi is compact for every i, tt is continuous, everywhere positive and hounded, 
implies condition (d) of Theorem \4-.18\ 
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Proof. Condition (i) implies condition (c?) of Theorem 14.181 as a consequence of 
Proposition 12.3 of 1 . For the second statement note that condition [ii) implies 
there exists K < oo, s.t. 7r(y)/7r(x) < K for every x,y G X. To conclude that 
(c?) results from (ii) note that 

|min{a, 6} - min{c,(i}| < \a - c\ + \b - d\ (22) 

and recall acceptance probabilities ai{x,y) = min 1 1 , | _ Indeed for 

any x ^ X and A e B{X) using (22) we have 



\Piix,A)^P2ix,A)\ < 



\qi{x,y), ^^-j^qi{y,x)\ 



{x) 

{<l2{x,y), ^|^q2(j/,a;)|^d2/ 



+ "^{xeA} 



X 



(^{\ - ai{x,y))qi{x,y) 



-(l - a2{x,y))q2{x,y)^Ay 

< A{K+1)\\Q,{x,-)-Q2{x,-)\\tv 

And the claim follows since a random scan Metropolis-within-Gibbs sampler is 
a mixture of Metropolis samplers. □ 

We now provide an example to show that diminishing adaptation of proposals 
as in Lemma [4.201 does not necessarily imply condition (d) of Theorem 14. 18( so 
some additional assumption is required, e.g. (i) or (ii) of Lemma 14.201 

Example 4.21. Consider a sequence of Metropolis algorithms with transi- 
tion kernels Pi,P2,. . . designed for sampling from 7r(fc) — p''{l — p) on X = 
{0, 1, . . . }. The transition kernel P„ results from using proposal kernel and 
the standard acceptance rule, where 



Qn{j,k) = q^ik) := 
Clearly 



P'' (t^ - + P^") for fc 7^ n, 

P^"(y3^ — p" -|-p^") ^ for k = n. 



sup ||Q„+i(j, •) - (3„(j, •)||tv = g„+i(n) - g„(n) ^ 0. 
jex 



However 



sup||P„+i(j,-)-P„(i,-)||Ty > P„+i(n,0)-P„(n,0) 



= min|gf„+i(0),^!y^g„+i(n)| 
L TT(n) J 

- min|g„(0), ^^(7„(n)| 

= g„+i(0)-9„(0)p"^l-p^0. 
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5. Ergodicity - nonuniform case 

In this section we consider the case where nonadaptive kernels are not necessary 
uniformly ergodic. We study adaptive random scan Gibbs adaptive Metropo- 
lis within Gibbs (AdapRSadapMwG) algorithms in the nonuniform setting, with 
parameters a £y and 7^ € Ti^i — subject to adaptation. The con- 

clusions we draw apply immediately to adaptive random scan Gibbs Metropolis 
within Gibbs (AdapRSMwG) algorithms by keeping the parameters 7; fixed for the 
Metropolis within Gibbs steps. 

We keep the assumption that selection probabilities are in y defined in ([5]), 
whereas the uniform ergodicity assumption will be replaced by some natural 
regularity conditions on the target density. 

Our strategy is to use the generic approach of [44] and to verify the dimin- 
ishing adaptation and the containment conditions. The containment condition 
has been extensively studied in [8] and it is essentially necessary for ergodicity 
of adaptive chains, see Theorem 2 therein for the precise result. In particular 
containment is implied by simultaneous geometrical ergodicity for the adaptive 
kernels. More precisely, we shall use the following result of [8 . 

Theorem 5.1 (Corollary 2 of [8]). Consider the family {P^ : 7 G F} 0/ Markov 
chains on X Q , satisfying the following conditions 

(i) for any compact set C G B{X), there exist some integer m > 0, and real 
p > 0, and a probability measure on C s.t. 

P^{x,-) > pv-yi-) for all xeC, 

(a) there exists a function V : X ^ [1, 00), s.t. for any compact set C € B{X), 
we have sup^.^^ V{x) < 00, ^{V) < 00, and 

P-rVix) ^ 
hm sup sup < 1, 

\x\^oc ter V[x) 

then for any adaptive strategy using {P^ : 7 G F}, containment holds. 

Throughout this section we assume Xi = R for i = 1, d, and A" = R"* and 
let fik denote the Lebsque measure on E''. By {ei, e^} denote the coordinate 
unit vectors and let | • | be the Euclidean norm. 

Our focus is on random walk Metropolis proposals with symmetric densities 
for updating Xi\X^i denoted as qi,ji{-), 7i G F^. We shall work in the following 
setting, extensively studied for nonadaptive Metropolis within Gibbs algorithms 
in [T7], see also [30] for related work and [33] for analysis of the random walk 
Metropolis algorithm. 

Assumption 5.2. The target distribution n is absolutely continuous with re- 
spect to fjLd with strictly positive and continuous density 7r(-) on X . 

Assumption 5.3. The family {qi,-yi}i<i<d;-ti&Ti of symmetric proposal densities 
with respect to /ii (one- dimensional Lebesgue measure) is such that there exist 



K. Latuszynski et al. /Adaptive Gibbs samplers 



19 



constants rji > 5i > 0, for i ~ l, d, s.t. 

inf qi^-y.(x) > rji for every 1 < i < d and 7^ G F^. (23) 

\x\<Si 

Assumption 5.4. There exist < (5 < A < 00, such that 

C:= inf / g,,^^(yVi(dj/) > 0, (24) 

anii, for any sequence x — {x^} with limj_^oo la^"*! = +00 there exists a subse- 
quence X = {x^} s.t. for some i G {1, . . . ,d} and all y € A], 

j.^^^ n{x^ ^ P .ii^+s^gnix^,)ye.) ^ ^ ^^5) 

j~^co ty(x1 — sign{xl)yei) 7r(iJ) 



Discussion of the seemingly involved 15.41 and simple criterions for checking it 
are given in |17j . It was shown in [IT that under these assumptions nonadaptive 
random scan Metropolis withing Gibbs algorithms are geometrically ergodic 
for subexponential densities. We establish ergodicity of the doubly adaptive 
AdapRSadapMwG algorithm in the same setting. 

Theorem 5.5. Let n he a subexponential density and let the selection probabil- 
ities otn &y for all n, with y as in Moreover assume that 

(a) Ickn — Qfri-il in probability for fixed starting values xo G X and ao £ y, 
7i e Ti, i = 1, . . . 

(b) The Metropolis-within- Gibbs kernels exhibit diminishing adaptation, i.e. 
for every i G {1, . . . , d} the Qn+i measurable random variable 

sup ||-Pa;_i, 7„+i ; {xi, •) — Px^t, 7„ ; {xi, ■)\\tv ^ in probability, as n ^ oo, 

x<£X 

for fixed starting values xq X and ao G 3^, 7i G Fi, i = 1, . . . , d; 

(c) Assumptions \5.2l \5.'A \5.4\ hold. 

Then AdapRSadapMwG is ergodic, i.e. 

r(xo, ao, 7o, h) — )• as n — > oo. (26) 

Before proving this result we state its counterpart for densities that are log- 
concave in the tails. This is another typical setting carefully studied in the 
context of geometric ergodicity of nonadaptive chains ([ITl |40l [3T]) where As- 
sumption [5l4] is replaced by the following two conditions. 

Assumption 5.6. There exists an (p > and S s.t. l/ip < S < A < oo and, 
for any sequence x := {x^} with limj_>.oo {x-'l = +oo, there exists a subsequence 
X := {x^} s.t. for some i G {1, . . . ,d} and for all y G [S, A], 

lim : < exp{— 0y} and 

j^oo 7r(ii — sign{xl)yei) 

Tr(x^ -\- siqn(x-')yei) . , . . 

hm ^ ^ \ < exp{-0y}. 27 
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Assumption 5.7. 

i<i<d,',i£ri Jg e<p[e-l) 

Remark 5.8. As remarked in |17| . AssumDtion l5.6l generalizes the one-dimensional 
definition of log-concavity in the tails and Assumption 15.71 is easy to ensure, at 
least if A = oo, by taking the proposal distribution to be a mixture of an adap- 
tive component and an uniform on [—U, U] for U large enough or a mean zero 
Gaussian with large enough variance. 

Theorem 5.9. Let the selection probabilities an € y for all n, with y as in 
1^. Moreover assume that 

(a) |a„ — a„_i| — > m probability for fixed starting values xq Si X and ao G y, 
7i e Tj, i = 1, . . . ,d; 

(b) The Metropolis-within- Gibbs kernels exhibit diminishing adaptation, i.e. 
for every i € {1, . . . , d} the Gn+i measurable random variable 

sup 'y^^.i i{xi, ■) — Px-i, 7„ ')\\tv in probability, as n oo, 

x<£X 

for fixed starting values xq (z X and ao G 3^, 7i G F^, z = 1, . . . , d; 
(a) Assumvtions [5^[5Jl [5751 hold. 

Then AdapRSadapMwG is ergodic, i.e. 

r(xo, ao, 70, ^ as n~^oo. (28) 
We now proceed to proofs. 

Proof of Theorem 15.51 Ergodicity will follow from Theorem 2 of [44] by estab- 
lishing diminishing adaptation and containment condition. Diminishing adap- 
tation can be verified as in the proof of Theorem 14.181 Containment will result 
form Theorem 15. II 

Recall that Pa,j is the random scan Metropolis within Gibbs kernel with 
selection probabilities a and proposals indexed by {'ji}i<i<d- To verify the small 
set condition (i) , observe that Assumptions 15.21 and 15.31 imply that for every 
compact set C and every vector 7^ G F^, i G 1, . . . , d, we can find m* and p* 
independent of {7^}, and such that Py^^{x,-) > p*v{-) for all x G C. Hence, 
arguing as in the proof of Lemma 14.41 there exist m and p, independent of {7^}, 
such that P™^(a;, •) > py{-) for aU x G C. 

To establish the drift condition (ii), let Vg := 7r(x)^'' for some s G (0,1) to 
be specified later. Then by Proposition 3 of [40 for all 1 < i < d, 7,; G F^, and 
a; G M'' we have 

-P^,7,14(x) < r{s)Vs{x) where r(s) 1 + s(l - s)^/""^ (29) 
Since r{s) — 1 as s ^ 0, we can choose s small enough, so that 
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The rest of the argument foUows the proof of Theorem 2 in [T^, we just need 
to make sure it is independent of a and 7. Assume by contradiction that there 
exists am R'^— valued sequence {x^} s.t. \mis\rpj^^Pa,-yVs{x^)/Vs{x^) > 1. 
Then there exists a subsequence {x^ such that hmj_).oo Pa,-yVs{x^)/Vs{x^) > 1. 
Moreover, as shown in [T7], there exists an integer k S {1, . . . ,d} and a further 
subsequence {x^}, independent of 7^, and such that 

lim Pk.^^VsHn/Vsiin < r(s) - (2r(s) - (30) 



The contradiction follows from ([29l) and ([30|) . since 
Pc^Vsii^) ,. ^ P^,J,Vs{S:') 

^ ^ Z— 1 ^ ^ 

= lim (a,P,,^^T4(i^)/T4(i^) + ^«,%^'^ 
< e(r(s) - (2r(s) - + (1 -£)r(s) < 1. 

□ 

Proof of Theorem \5.9l The proof is along the same lines as the proof of Theorem 
15.51 and the proof of Theorem 3 of [17] and is omitted. □ 



Example 5.10. We now give an example involving a simple generalised linear 
mixed model. Consider the model and prior given by 

Y, ~ Pais {e^+^^) (31) 
Xi ~iV(0,l) (32) 
e - iV(0, 1) (33) 

The model is chosen to be extremely simple so as to not detract from the argu- 
ment used to demonstrate ergodicity of adapRSadapMwG, although this argument 
readily generalises to different exponential families, link functions and random 
effect distributions. 

We consider simulating from the posterior distribution of 0, X given obser- 
vations Uit ■ .yn using adapRSadapMwG. More specifically we set 

exp{-{y,-x,f/2j} 

qx^,,^{xi,yi) ^ == (34) 

V 27i'7 

where the range of permissible scales 7 is restricted to be in some range 3? = [a, 6] 
with < a < 6 < 00. We are in the subexponential tail case and specifically we 
have the following. 

Proposition 5.11. Consider adapRSadapMwG applied to model i31\) using any 
adaptive scheme satisfying the conditions (a) and (b) of Theorem \5.5\ Then the 
scheme is ergodic. 
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For the proof, we require the fonowing definition from [17 . We let 

$ — {functions : — > R+; (t){x) cx) as a; — > oo}. 

Proof, of Proposition 15.111 According to Theorem 15.51 it remains to check 
conditions 15.21 [Ol 15.41 hold. Conditions 15.21 and 15.31 hold by construction, while 
condition 15.41 consists of two separate conditions. One of these, given in (jM)) . 
holds by construction from (p4l) . Moreover, [17] shows that ([25)) can be replaced 
by the following condition: there exist functions {(pi € 1 <i < d] such that 
i e {1, . . . and all y e [5, A], 

ana nm|2,^|_joo sup{-^_^. ^-(\xj\)<4,,{\xi\), j=ii} ~ ^- V^"/ 

Now take (pi{x) = a: for all 1 < i < d so that l\'S5\i can be rewritten as the two 
conditions 



lim sup exp / V ilogTT{x + sign{xi)zei)dz> = (37) 

{x_i; \xj\<\xi\, j^i} iJ-y } 

lim sup exp < / VilogTT{x + sign{xi)zei)dz> = (38) 

{x^i- \xi\<\xi\, j^i} [Jo } 



for all y G [S, A] , where denotes the derivative in the ith direction. We 
shall show that uniformly on the set Si{xi) which is defined to be {x-i] \xj\ < 
j 7^ *} the function V; log7r(x) converges to — oo as Xi — > +oo and to +oo 
as Xi approaches — oo. 

Now we have d = n+1 and let i correspond to the component Xi for 1 < i < n 
with n + 1 denoting the component 9. Therefore for \ < i < n 

V^ log 7r(a;) = -e®+'=' + - 

and 

n n 

V„+i log^(x) = - ^ e^+^' - E 2^' - ^ 

i=l 4=1 

Now for Xi > 0, 1 < i < n 

Vi log7r(x) >yt-Xi 
which is diverging to — oo independently of Similarly, 

n 

V„+i log Tr{x) >y^^yi-0 

i=l 

diverging to — oo independently of {x^; 1 < i < n}. 
For Xi < 0, 1 < i < n and {x-i,0) e Si{xi), 

Vilog7r(a;) < yi - Xi + 1 
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again diverging to +00 uniformly. Finally for < and x G 

n 

V„+i log 7r(a;) > -n + ^ j/i - 6* 

i=l 

again demonstrating the required uniform convergence. Thus ergodicity holds. 

□ 

Remark 5.12. The random effect distribution in Example 15.101 can be altered 
to give different results. For instance if the distribution is doubly exponential 
Theorem 14.21 can be applied using very similar arguments to those used above. 
Extensions to more complex hierarchical models are clearly possible though we 
don't pursue this here 

Remark 5.13. An important problem that we have not focused on involves the 
construction of explicit adaptive strategies. Since little is know about the optimi- 
sation of the Random Scan Random Walk Metropolis even in the non-adaptive 
case, this is not a straightforward question. Further work we are engaged in is 
exploring adaptation to attept to maximise a given optimality criterion for the 
chosen class of samplers. Two possible strategies are 

• to scale the proposal variance to approach 2.4x the empirically observed 
conditional variance; 

• to scale the proposal variance to achieve an algorithm with acceptance 
proportion approximately 0.44. 

Both these methods are founded in theoretical arguments, see for instance |42j . 
6. Proof of Proposition [3T2] 

The analysis of Example 13. II is somewhat delicate since the process is both time 
and space inhomogeneous (as are most nontrivial adaptive MCMC algorithms) . 
To establish Proposition 13.21 we will define a couple of auxiliary stochastic pro- 
cess. Consider the following one dimensional process (X„)„>o obtained from 
{Xn)n>o by 

Xn '■= XnS + Xn,2 — 2. 

Clearly X„ — Xn-i & {~1,0, 1}, moreover Xns — ^ oo and Xn^2 — >■ oo if and 
only if Xn — >■ oo. Note that the dynamics of (X„)„>o are also both time and 
space inhomogeneous. 

We will also use an auxiliary random-walk-like space homogeneous process 

n 

5*0 = and Sn ■= Yi, for n > 1, 

i=l 

where Yi,Y2, . . . are independent random variables taking values in {—1, 0, 1}. 
Let the distribution of Yn on {—1, 0, 1} be 
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We shall couple {Xn)n>o with (5'„)„>o, i.e. define them on the same prob- 
ability space {ri, J", P}, by specifying the joint distribution of (X„,S'„)„>o so 
that the marginal distributions remain unchanged. We describe the details of 
the construction later. Now define 

^x>s e n : Xn{^) > Sn{oj) for every n} (40) 

and 

Qoo ■■= {oj & ■■ S'„(w) oo}. (41) 

Clearly, if a; e ^^x>s ^ ^oo, then X„(u;) -> oo. In the sequel we show that for 
our coupling construction 

P(17^>5nOoo) >0. (42) 

We shall use the Hoeffding's inequality for 5^+" ;= S^+n — Sk- Since F„ € 
[—1,1], it yields for every t > 0, 

P(5^+" - E5;^+" < -nt) < exp{-int2}. (43) 

Note that EF„ = 2/a„ and thus £5^+" = 2X;i=fe+i l/a,. The following choice 
for the sequence a„ will facilitate further calculations. Let 

bo = 0, 
bi = 1000, 

bn = bn-i(l + j^—-^——r), for n>2 
V 10 + log(n)/ 

n 

^ ] bji , 

a„ = 10 + log(fc), for Cfc-i < n < Cfe. 

Remark 6.1. To keep notation reasonable we ignore the fact that bn will not be 

an integer. It should be clear that this docs not affect proofs, as the constants 
we have defined, i.e. hi and ai are bigger then required. 

Lemma 6.2. Let F„ and S„ be as defined above and let 
Oi := e O : 5/; = A; for every < A; < cij. (44) 



:= ^cj £ fl : Sk > for every c„_i < A; < c„| for n>2. (45) 



Then 



¥(f]nA>0. (46) 
Remark 6.3. Note that bn oo and therefore H^i C Ooo. 
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Proof. With positive probability, say pi.s, we have Yi = • • • = Yiooo =^ 1 which 
gives = 1000 = bi. Hence P(r2i) = pi,s > 0. Moreover recall that S'^''^_^ is a 
sum of hn i.i.d. random variables with E5'^j^_^ = lo+io" („) ■ Therefore for every 
n>\hy Hoeffding's inequality with t = 1/(10 + log(?i)), we can also write 



Pn- 



10 + log(n)/ - \ 2 (10 + log(n))2 
Therefore using the above bound iteratively we obtain 

oo 

P(5,, = 5i, 5e„ > bn for every n > 2) > Pi^sX{{l - Pn)- (47) 

n=2 

Now consider the minimum of S'fc for c„_i < < c„ and n > 2. The worst case 
is when the process Sk goes monotonically down and then monotonically up for 
c„-i < k < Cn- By the choice of b„, equation (|T7)) implies also 

oo \ oo 

f]^n]>Pl,sY[il-Pn). (48) 

n=l ^ n=2 

Clearly in this case 

oo oo oo 

Pi,s Y[{1 - Pn) > ^ ^log(l-p„) > -CX) ^ ^p„<CX). (49) 
n=2 n— 1 n— 1 

We conclude (j49|) by comparing p„ with l/n^. We show that there exists no 
such that for n > uq the series p„ decreases quicker then the series 1 /n'^ and 
therefore p„ is summable. We check that 

2 

log^^^ > log , for n>nQ. (50) 

Pn [n - V 

Indeed 

,Pn-l 1 / bn-1 bn 



log: 



Pn 2 V(10 + log(n- 1))2 (10 + log(n))2 

bn-1 ( ll + log(n) 1 



,(10 + log(n))3 (10 + log(n- 1))2 
6„_i / (11 + log(n))(10 + log(n - 1))^ - (10 + log(7i))= 



(10 + log(n))3(10 + log(n- 1))2 

Now recall that fe„_i is an increasing sequence. Moreover the enumerator can 
be rewritten as 

(10 + log(n)) ((10 + login - 1))^ - (10 + log(n))2) + (10 + log(n - 1))^, 
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6^ = (a + b){a — b) to identify the leading term (10 + log(n — 1))^. 
Consequently there exists a constant C and no G s.t. for n> tiq 

1 > 2 

p„ - (10 + log(n))3 n~l °S(n-l)2- 

Hence Y^^=iPn < co follows. □ 

Now we will describe the coupling construction of {Xn)n>o and (S'„)„>o- We 
already remarked that HJ^i C SIoq. We will define a coupling that implies 
also 

Pi nf^^><5 j > CPf" Pi f^n") for some universal C > 0, (51) 
and therefore 

p(l)^>5nl7oo) > 0. (52) 

Thus nonergodicity of {Xn)n>o will follow from Lemma 16.21 We start with the 
following observation. 

Lemma 6.4. There exists a coupling of Xn — Xn-i and Yn, such that 

(a) For every n>l and every value of Xn-i 

p(x„ - x„_i - 1, r„ - 1) > p(x„ - Xn-i = i)P(n. = 1), (53) 

(b) Write even or odd Xn~i as Xn^i = 2i — 2 or Xn-i = 2i — 3 respectively. 
If 2i — 8 > a„ then the following implications hold a.s. 

r„ = i => l„-x„_i-i (54) 

Xn-Xn-i = -l Yn = -1. (55) 

Proof. Property (a) is a simple fact for any two { — 1,0,1} valued random 
variables Z and Z' with distributions say {di, d2, ds} and {d'l, d'2, d'^}. Assign 
F{Z = Z' = 1) min{d3,c?3} and (a) follows. To establish (b) we analyse the 
dynamics of {Xn)n>o and consequently of {Xn)n>o- Recall Algorithm 12.21 and 
the update rule for in Given A"„_i = ihj), the algorithm will obtain 
the value of q;„ in step 1, next draw a coordinate according to (a„_i,a„_2) in 
step 2. In steps 3 and 4 it will move according to conditional distributions for 
updating the first or the second coordinate. These distributions are 



(1/2,1/2) and 



(z-l)2 



i2 + (z-l)2'j2_^(j_l)i 



respectively. Hence given X„_i = (z, i) the distribution of X„ e {{i,i—l),{i,i),{i+ 
l,i)} is 

14. ^2 /l 4\ *^ ^ 



K. Latuszynski et al. /Adaptive Gibbs samplers 



27 



whereas if X„_i = — 1) then Xn € {{i — l,i — — with 

probabihties 

1 ^ ^1 ('"^)' I 4 ^ (»-l)^ \ 

respectively. We can conclude the evolution of (X„)„>o. Namely, if Xn-i = 
2i — 2 then the distribution of Xn ~ Xn-i E {—1,0, 1} is given by (|56l) and if 
Xn-i — 2i~3 then the distribution of Xn — Xn-i G {—1,0,1} is given by ((57)) . 
Let <st denote stochastic ordering. By simple algebra both measures defined in 
and (|57p are stochastically bigger then 



lA = (M«,l'Ain,2,M^,3)> (58) 



where 



fl 2 . 2 1 1 2i + 8-a„ 

= I ^(1 ^ ) = 1 I 1 I 2max{4,z}-8-a„ 

^"'3 M a,/^ max{4,^}^ 4 a„ 2a„ max{4, i} 

Recall i^n , the distribution of Yn defined in l\'39\i . Examine (|59l) and (pO)) to see 
that if 2i — 8 > a„, then fj.n >st Vn- Hence in this case also the distribution 
of Xn — Xn-i is stochastically bigger then the distribution of Y^. The joint 
probability distribution of (X„ — X„_i, y„) satisfying ([M)) and (f55|) follows. □ 



Proof of Proposition \3.2[ Define 



|cj e Q : Xn — Xn-i = 1 for every < n < ci|. (61) 



Since the distribution of Xn — Xn~i is stochastically bigger then defined in 
and ^5j(1) > c > for every i and n, 

IPK^f) =--Pi,x>0- 

By Lemma 16.41 (a) we have 

p(r!, j^nrii) >pi,5Pi,;^ >o. (62) 

Since S'^ — Xc-^ — ci — hi, onO.^ j^C\Q.i, the requirements for Lemma WM (b) 
hold for n — 1 = ci . We shall use Lemma 16.41 (h ) iteratively to keep Xn > Sn 
for every n. Recall that we write X„_i as Xn-i = 2z — 2 or X„_i = 2i — 3. If 
2z — 8 > a„ and Xn-i > Sn-i then by Lemma 16.41 f&j also Xn > Sn- Clearly if 
Xk > Sk and Sk > for c„_i < k < Cn then X^, > ^^^^ for c„_i < k < Cn, 
hence ^ 

2« — 2 > — — - for c„_i < k < Cn- 
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This in turn gives 2i — 8> — 6 for c„_i < fc < c„ and since ttfe = 10+log(n), 
for the iterative construction to hold, we need 6„ > 32 + 21og(n + 1). By the 
definition of &„ and standard algebra we have 

bn > 1000 I 1 + — — I , , I > 32 + 21og(n + 1) for every n > 1. 

Summarising the above argument provides 



> Pi^xPhS YlCi- - Pn) > 0. 

n=2 



Hence (X„)„>o is not ergodic, and in particular ||7r„ — ttHtv ^0. □ 
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